LAB: Linear Regression on Whiteside data

Published

February 13, 2026

M1 MIDS/MFA/LOGOS

Université Paris Cité

Année 2025

Course Homepage

Moodle

Introduction

The purpose of this lab is to introduce linear regression using base R and the tidyverse. We work on a dataset provided by the MASS package. This dataset is investigated in the book by Venables and Ripley. This discusssion is worth being read. Our aim is to relate regression as a tool for data exploration with regression as a method in statistical inference. To perform regression, we will rely on the base R function lm() and on the eponymous S3 class lm. We will spend time understanding how the formula argument can be used to construct a design matrix from a dataframe representing a dataset.

Packages installation and loading (again)

Code
# We will use the following packages. 
# If needed, install them : pak::pkg_install(). 
stopifnot(
  require("magrittr"),
  require("lobstr"),
  require("ggforce"),
  require("patchwork"), 
  require("gt"),
  require("glue"),
  require("skimr"),
  require("corrr"),
  require("GGally"),
  require("broom"),
  require("tidyverse"),
  require("ggfortify"),
  require("autoplotly")

)

Besides the tidyverse, we rely on skimr to perform univariate analysis, GGally::ggpairs to perform pairwise (bivariate) analysis. Package corrr provide graphical tools to explore correlation matrices. At some point, we will showcase the exposing pipe %$% and the classical pipe %>% of magrittr. We use gt to display handy tables, patchwork to compose graphical objects. glue provides a kind of formatted strings. Package broom proves very useful when milking lienar models produced by lm() (and many other objects produced by estimators, tests, …)

Dataset

The dataset is available from package MASS. MASS can be downloaded from cran.

Code
whiteside <- MASS::whiteside # no need to load the whole package

cur_dataset <- str_to_title(as.character(substitute(whiteside)))
# ?whiteside

The documentation of R tells us a little bit more about this data set.

Mr Derek Whiteside of the UK Building Research Station recorded the weekly gas consumption and average external temperature at his own house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed. The object of the exercise was to assess the effect of the insulation on gas consumption.

This means that our sample is made of 56 observations. Each observation corresponds to a week during heating season. For each observation. We have the average external temperature Temp (in degrees Celsius) and the weekly gas consumption Gas. We also have Insul which tells us whether the observation has been recorded Before or After treatment.

Temperature is the explanatory variable or the covariate. The target/response is the weekly Gas Consumption. We aim to predict or to explain the variations of weekly gas consumption as a function average weekly temperature.

The question is wether the treatment (insulation) modifies the relation between gas consumption and external temperature, and if we conclude that the treatment modifies the relation, in which way?.

Have a glimpse at the data.

Code
whiteside |>
  glimpse()
Rows: 56
Columns: 3
$ Insul <fct> Before, Before, Before, Before, Before, Before, Before, Before, …
$ Temp  <dbl> -0.8, -0.7, 0.4, 2.5, 2.9, 3.2, 3.6, 3.9, 4.2, 4.3, 5.4, 6.0, 6.…
$ Gas   <dbl> 7.2, 6.9, 6.4, 6.0, 5.8, 5.8, 5.6, 4.7, 5.8, 5.2, 4.9, 4.9, 4.3,…

Even though the experimenter, Mr Whiteside, decided to apply a treatment to his house. This is not exactly what we call experimental data. Namely, the experimenter has no way to clamp the external temperature. With respect to the Temperature variable (the explanatory variable) we are facing observational data.

Columnwise exploration

NoteQuestion

Before before proceeding to linear regressions of Gas with respect to Temp, perform univariate analysis on each variable.

  • Compute summary statistics
  • Build the corresponding plots
TipSolution

skimr does the job. There are no missing data, complete rate is always 1, we remove non-informative columns from the output.

Code
whiteside |>
  skimr::skim() |>
  select(-n_missing, -complete_rate, -factor.ordered, - factor.n_unique) |>
  gt() |>
  gt::fmt_number(decimals=1) |>
  gt::tab_caption("Whiteside dataset. Columnwise summaries")
Whiteside dataset. Columnwise summaries
skim_type skim_variable factor.top_counts numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100 numeric.hist
factor Insul Aft: 30, Bef: 26 NA NA NA NA NA NA NA NA
numeric Temp NA 4.9 2.7 −0.8 3.1 4.9 7.1 10.2 ▃▅▇▇▃
numeric Gas NA 4.1 1.2 1.3 3.5 4.0 4.6 7.2 ▁▆▇▂▁

An alternative way of doing this univariate analysis consists in separating categorical variables from numerical ones.

Code
sk <- whiteside %>% 
  skimr::skim() %>% 
  select(-n_missing, - complete_rate, -factor.ordered, - factor.n_unique)

skimr::yank(sk, "factor") |> 
  gt() |>
  gt::tab_caption("Whiteside dataset. Categorical variables summaries")


skimr::yank(sk, "numeric") |> 
  gt() |>
  gt::fmt_number(decimals=1) |>
  gt::tab_caption("Whiteside dataset. Categorical variables summaries")

We may test the normality of the Gas and Temp distribution

Normality tests for Whiteside dataset
Var statistic p.value method
Temp 0.98 0.48 Shapiro-Wilk normality test
Gas 0.96 0.08 Shapiro-Wilk normality test

Both variables pass the Shapiro-Wilk test with flying colors.

We may use the Jarque-Bera test See R bloggers on this

Other normality tests for Whiteside dataset
Var statistic p.value parameter method
Temp 1.37 0.50 2.00 Jarque Bera Test
Gas 2.74 0.25 2.00 Jarque Bera Test

Both variables also pass the Jarque-Bera test with flying colors. This is noteworthy since the Jarque-Bera compares empirical skewness and kurtosis to Gaussian skewness and kurtosis.

Pairwise exploration

NoteQuestion

Compare distributions of numeric variables with respect to categorical variable Insul

TipSolution

We start by plotting histograms

To abide to the DRY principle, we take advantage of the fact that aesthetics and labels can be tuned incrementally.

Code
p_xx <- whiteside |>
    ggplot() +
      geom_histogram(
        mapping=aes(fill=Insul, color=Insul, y=after_stat(density)),
        position="dodge",
        alpha=.1,
        bins=10) 

p_1 <- p_xx + 
  aes(x=Temp)  +
  theme(legend.position = "None")+
  labs(title="External Temperature",
       subtitle = "Weekly Average (Celsius)")

p_2 <- p_xx + 
  aes(x=Gas)  +
  labs(title="Gas Consumption" ,
       subtitle="Weekly Average")

# patchwork the two graphical objects 
(p_1 + p_2) +
  patchwork::plot_annotation(
    caption="Whiteside dataset from package MASS"
  )

Note the position parameter for geom_histogram(). Check the result for position="stack", position="identity". Make up your mind about the most convenient choice.

Gas Consumption distribution After looks shifted with respect to distribution Before.

Another visualization strategy consists in using the faceting mechanism

Code
r <- whiteside |> 
  pivot_longer(cols=c(Gas, Temp),
              names_to = "Vars",
              values_to = "Vals") |>
  ggplot() +
  aes(x=Vals)  +
  facet_wrap(~ Insul + Vars ) + 
  xlab("")

r +
  aes(y=after_stat(density)) +
  geom_histogram(alpha=.3, fill="white", color="black", bins=6) +
  ggtitle(glue("{cur_dataset} data")) +
  theme_minimal()

Visual inspection suggests that the distributions of Temp before and after insulation are different.

We may further explore this question using linear modeling. The output of function lm() is an object of class lm (a list with attributes). This object can be fed to broom::glance() to obtain

Code
lm(Temp ~ Insul, data=whiteside) |>
  broom::glance() |> 
  gt::gt() |>
  gt::fmt_number(decimals=2)  |>
  gt::tab_caption("Linear regression of Temp with respect to Insul, whiteside dataset")
Linear regression of Temp with respect to Insul, whiteside dataset
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.03 0.01 2.74 1.46 0.23 1.00 −134.85 275.70 281.78 404.85 54.00 56.00
Code
lm(Temp ~ Insul, data=whiteside) |>
  anova()  |>
  broom::tidy() |>
  gt::gt() |>
  gt::fmt_number(decimals = 3) |>
  gt::tab_caption("Testing temperature homogeneity over the seasons")
Testing temperature homogeneity over the seasons
term df sumsq meansq statistic p.value
Insul 1.000 10.950 10.950 1.461 0.232
Residuals 54.000 404.855 7.497 NA NA

The difference between temperature distributions over the two seasons is not deemed to be significant by ANOVA.

It is important that the grouping variable (here Insul) is a factor.

Covariance and correlation between Gas and Temp

NoteQuestion

Compute the covariance matrix of Gas and Temp

TipSolution
Code
# Compute the covariance matrix for Gas and Temp
C <- whiteside |>
  select(where(is.numeric)) |> 
  cov()

mu_n <- whiteside %>% 
  select(where(is.numeric)) |>
  colMeans()
Code
matador::mat2latex(round(C,2))

\[ \begin{bmatrix} 7.56 & -2.19\\ -2.19 & 1.36 \end{bmatrix} \]

NoteQuestion
  • Compute correlations (Pearson, Kendall, Spearman) and correlations per group
  • Comment
TipSolution
Code
# use magrittr pipe %>%  to define a pipeline function
my_cor <-  . %>% 
  summarize(
    pearson= cor(Temp, Gas, method="pearson"),
    kendall=cor(Temp, Gas, method="kendall"),
    spearman=cor(Temp,Gas, method="spearman")
  ) 

# even better: pass column names as arguments

my_cor2 <-  function(df, col1, col2){
  df |>
  summarize(
    pearson= cor({{col1}}, {{col2}}, method="pearson"),
    kendall=cor({{col1}}, {{col2}}, method="kendall"),
    spearman=cor({{col1}}, {{col2}}, method="spearman")
  ) 
}

t1 <- whiteside |>
  group_by(Insul) |> 
  my_cor()

# group_by(whiteside, Insul) |> my_cor2(Gas, Temp)

t2 <- whiteside |>
  my_cor() |>
  mutate(Insul="pooled")

# whiteside |> my_cor2(Gas, Temp) |> mutate(Insul="pooled")

bind_rows(t1, t2)  |> 
    gt() |>
    gt::fmt_number(decimals=2) |>
    gt::tab_caption("Whiteside data: correlations between Gas and Temp")
Whiteside data: correlations between Gas and Temp
Insul pearson kendall spearman
Before −0.97 −0.85 −0.96
After −0.90 −0.72 −0.86
pooled −0.68 −0.47 −0.62

Note the sharp increase of all correlation coefficients we data are grouped according to the control/treatment variable (Insul).

NoteQuestion

Use ggpairs from GGally to get a quick overview of the pairwise interactions.

TipSolution
Code
theme_set(theme_minimal())

whiteside |>
  GGally::ggpairs()

NoteQuestion

Build a scatterplot of the Whiteside dataset

TipSolution
Code
p <- whiteside |> 
  ggplot() +
  aes(x=Temp, y=Gas) +
  geom_point(aes(shape=Insul)) +
  xlab("Average Weekly Temperature (Celsius)") +
  ylab("Average Weekly Gas Consumption 1000 cube feet") +
  labs(
    ## Use list unpacking
  )

p + 
  ggtitle(glue("{cur_dataset} dataset")) +
  theme_minimal()

Note that the dataset looks like the stacking of two bananas corresponding to the two heating seasons.

NoteQuestion

Build boxplots of Temp and Gas versus Insul

TipSolution
Code
q <- whiteside |> 
  ggplot() +
  aes(x=Insul) +
  xlab("") +
  theme_minimal()

qt <- q + 
  geom_boxplot(aes(y=Temp)) 

qg <- q + 
  geom_boxplot(aes(y=Gas)) 


(qt + qg) +
  patchwork::plot_annotation(title = glue("{cur_dataset} dataset"))

Note the two low extremes/outliers for the Gas Consumption after Insulation.

NoteQuestion

Build violine plots of Temp and Gas versus Insul

TipSolution
Code
(q + 
  geom_violin(aes(y=Temp))) +
(q + 
  geom_violin(aes(y=Gas))) +
  patchwork::plot_annotation(title = glue("{cur_dataset} dataset"))

NoteQuestion

Plot density estimates of Temp and Gas versus Insul.

TipSolution
Code
r +
  stat_density(alpha=.3 , 
               fill="grey", 
               color="black", 
#               bw = "SJ",
               adjust = .85 ) +
  ggtitle(glue("{cur_dataset} data")) +
  theme_minimal()

Hand-made calculation of simple linear regression estimates for Gas versus Temp

NoteQuestion

Compute slope and intercept using elementary computations

TipSolution
Code
slope <- C[1,2] / C[1,1] # slope 

intercept <- whiteside %$%  # exposing pipe from magrittr
  (mean(Gas) - slope * mean(Temp)) # intercept

# with(whiteside,
#     mean(Gas) - b * mean(Temp)) 

In simple linear regression, the slope follows from the covariance matrix in a straightforward way. The slope can also be expressed as the linear correlation coefficient (Pearson) times the ration between the standard deviation of the response variable and the standard deviation of the explanatory variable.

NoteQuestion

Overlay the scatterplot with the regression line.

TipSolution
Code
p + 
  geom_abline(slope=slope, intercept = intercept) +
  ggtitle(glue("{cur_dataset} data"), subtitle = "Least square regression line")

Using lm()

lm stands for Linear Models. Function lm has a number of arguments, including:

  • formula
  • data
NoteQuestion

Use lm() to compute slope and intercept. Denote the object created by constructor lm() as lm0.

  • What is the class of lm0 ?
TipSolution
Code
lm0 <- lm(Gas ~ Temp, data = whiteside)

The result is an object of class lm.

The generic function summary() has a method for class lm

Code
lm0 %>% 
  summary()

Call:
lm(formula = Gas ~ Temp, data = whiteside)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6324 -0.7119 -0.2047  0.8187  1.5327 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.4862     0.2357  23.275  < 2e-16 ***
Temp         -0.2902     0.0422  -6.876 6.55e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8606 on 54 degrees of freedom
Multiple R-squared:  0.4668,    Adjusted R-squared:  0.457 
F-statistic: 47.28 on 1 and 54 DF,  p-value: 6.545e-09

The summary is made of four parts

  • The call. Very useful if we handle many different models (corresponding to different formulae, or different datasets)
  • A numerical summary of residuals
  • A commented display of the estimated coefficients
  • Estimate of noise scale (under Gaussian assumptions)
  • Squared linear correlation coefficient between response variable \(Y\) (Gas) and predictions \(\widehat{Y}\)
  • A test statistic (Fisher’s statistic) for assessing null hypothesis that slope is null, and corresponding \(p\)-value (under Gaussian assumptions).

Including a rough summary in a report is not always a good idea. It is easy to extract tabular versions of the summary using functions tidy() and glance() from package broom.

For html output gt::gt() allows us to polish the final output

TipSolution

We can use the exposing pipe from magrittr (or the with construct from base R) to build a function that extracts the coefficients estimates, standard error, \(t\)-statistic and associated p-values.

Code
tidy_lm <- . %$% (     # <1>  The lhs is meant to be of class lm
  tidy(.)  %>%         # <2> . acts as a pronoun for magrittr pipes     
  gt::gt() %>% 
  gt::fmt_number(decimals=2) %>% 
  gt::tab_caption(glue("Linear regrression. Dataset: {call$data},  Formula: {deparse(call$formula)}"))  # <3> call is evaluated as a member of the pronoun `.`
)

tidy_lm(lm0)
Linear regrression. Dataset: whiteside, Formula: Gas ~ Temp
term estimate std.error statistic p.value
(Intercept) 5.49 0.24 23.28 0.00
Temp −0.29 0.04 −6.88 0.00

deparse() is an important function from base R. It is very helpful when trying to take advantage of lazy evaluation mechanisms.

NoteQuestion

Function glance() extract information that can be helpful when performing model/variable selection.

TipSolution

The next chunk handles several other parts of the summary.

Code
glance_lm <-  . %$% (
  glance(.) %>% 
  mutate(across(-c(p.value), 
                ~ round(.x, digits=2)),
         p.value=signif(p.value,3)) %>% 
  gt::gt() %>% 
  gt::tab_caption(glue("Dataset: {call$data},  Formula: {deparse(call$formula)}"))
)

glance_lm(lm0)
Dataset: whiteside, Formula: Gas ~ Temp
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.47 0.46 0.86 47.28 6.55e-09 1 -70.04 146.07 152.15 39.99 54 56
  • r.squared (and adj.r.squared)
  • sigma estimates the noise standard deviation (under homoschedastic Gaussian noise assumption)
  • statistic is the Fisher statistic used to assess the hypothesis that the slope (Temp coefficient) is zero. It is compared with quantiles of Fisher distribution with 1 and 54 degrees of freedom (check pf(47.28, df1=1, df2=54, lower.tail=F) or qf(6.55e-09, df1=1, df2=54, lower.tail=F)).
NoteQuestion

R offers a function confint() that can be fed with objects of class lm. Explain the output of this function.

NoteSolution

Under the Gaussian homoschedastic noise assumption, confint() produces confidence intervals for estimated coefficients. Using the union bound, we can derive a conservative confidence rectangle.

Code
M <- confint(lm0, level=.95)  

as_tibble(M) |>
  mutate(Coeff=rownames(M)) |>
  relocate(Coeff) |>
  gt::gt() |>
  gt::fmt_number(decimals=2)
Coeff 2.5 % 97.5 %
(Intercept) 5.01 5.96
Temp −0.37 −0.21
Code
conf_levels <- c(.90, .95, .99)

purrr::map(conf_levels, \(x) confint(lm0, level=x)) |>
  purrr::map(as_tibble) |>
  purrr::map(\(x) mutate(x, coeffs=c("Intercept", "Temp")))|>
  purrr::map(\(x) pivot_longer(x, cols=ends_with("%"), names_to="conf", values_to="bound")) |>
  bind_rows() |>
  gt::gt() |>
  gt::fmt_number(decimals = 2) |>
  gt::tab_caption("Confidence bounds for simple linear regression on whiteside data")
Confidence bounds for simple linear regression on whiteside data
coeffs conf bound
Intercept 5 % 5.09
Intercept 95 % 5.88
Temp 5 % −0.36
Temp 95 % −0.22
Intercept 2.5 % 5.01
Intercept 97.5 % 5.96
Temp 2.5 % −0.37
Temp 97.5 % −0.21
Intercept 0.5 % 4.86
Intercept 99.5 % 6.12
Temp 0.5 % −0.40
Temp 99.5 % −0.18
NoteQuestion

Plot a \(95\%\) confidence region for the parameters (assuming homoschedastic Gaussian noise).

TipSolution

Assuming homoschedastic Gaussian noise, the distribution of \((X^T X)^{1/2}\frac{\widehat{\theta}- \theta}{\widehat{\sigma}}\) is \(\frac{\mathcal{N}(0, \text{Id}_p)}{\sqrt{\chi^2_{n-p}/(n-p)}}\) with \(p=2\) and \(n=56\), where the Gaussian vector and the \(\chi^2\) variable are independent. Hence \[\frac{1}{2}\left\Vert (X^T X)^{1/2}\frac{\widehat{\theta}- \theta}{\widehat{\sigma}} \right\Vert^2= \frac{(\widehat{\theta}-\theta)^T X^TX (\widehat{\theta}-\theta)}{2\widehat{\sigma}^2}\] is distributed accordding to a Fisher distribution with \(p=2\) and \(n-p=54\) degrees of freedom. Let \(f_{\alpha}\) denote the quantile of order \(1-\alpha\) of the Fisher distribution with \(p=2\) and \(n-p=54\) degrees of freedom, the following ellipsoid is a \(1-\alpha\) confidence region: \[\left\{ \theta : \frac{(\widehat{\theta}-\theta)^T X^TX (\widehat{\theta}-\theta)}{2\widehat{\sigma}^2} \leq f_{\alpha}\right\}\]

Code
X <- model.matrix(lm0)
coefficients(lm0)["(Intercept)"]
(Intercept) 
   5.486193 
Code
sum(residuals(lm0)^2)
[1] 39.99487

Function conf_region builds a confidence ellipsoid for simple linear regression coefficients based on design matrix X, OLS estimated parameters x0, y0, estimated noise standadrd deviation sigma.

Code
conf_region <- function(X, x0=0, y0=0, sigma=1, df1=2, df2=nrow(X)-2, alpha=.05, ...){ 

  falpha <- qf(alpha, df1, df2, lower.tail = F)  # quantile of F distribution 

  specdec <- eigen(t(X) %*% X)
  angle <- acos(specdec$vectors[1,1])
  a <- sigma * sqrt(2*falpha/specdec$values[1])
  b <- sigma * sqrt(2*falpha/specdec$values[2])

  

  ggforce::geom_ellipse(
    aes(
      x0= x0,
      y0= y0,
      a = a,
      b = b,
      angle= angle
    ),
    ...
)
}

Package zeallot aims at providing a kind of unpacking.

Code
require(zeallot)  # experimental 

c(x0,y0) %<-% coefficients(lm0)  
# x0 <- coefficients(lm0)['(Intercept)']
# y0 <- coefficients(lm0)['Temp']

ggplot() +
  conf_region(X, x0, y0, linetype='dashed') +
  conf_region(X, x0, y0, alpha=.01, linetype='dotted') +
  conf_region(X, x0, y0, alpha=.1 ) +
  coord_fixed() +
  theme_minimal()

Diagnostic plots

Method plot.lm() of generic S3 function plot from base R offers six diagnostic plots. By default it displays four of them.

In order to obtain diagnostic plots as ggplot objects, use package ggfortify which defines an S3 method for class ‘lm’ for generic function autoplot (defined in package ggplot).

NoteQuestion

What are the diagnostic plots good for?

TipSolution

Diagnostic plots for linear regression serve several purposes:

  • Visualization of possible dependencies between residuals and fitted values. OLS theory tells us that residuals and fitted values are (empirically) linearly uncorrelated. This does not preclude other kinds of dependencies.
  • With some overplotting, it is also possible to visualize possible dependencies between residuals and variables that were not taken into account while computing OLS estimates.
  • Assesment of possible heteroschedasticity, for example, dependence of noise variance on fitted values.
  • Assessing departure of noise distribution from Gaussian setting.
  • Spotting points with high leverage
  • Spotting possible outliers.
NoteQuestion

Load package ggfortify, and call autoplot() (from ggplot2) to build the diagnostic plots for lm0.

Generic function autoplot() called on on an object of class lm builds a collection of grapical objects that parallel the output of base R generic plot() function. The graphical objects output by autoplot() can be further tweaked as any ggplot object.

TipSolution
Code
vanilla <- c(
  "title"= "Diagnostic plot for linear regression  Gas ~ Temp",
  "caption"= "Whiteside data from package MASS"
)
Code
diag_plots <- autoplot(lm0, data=whiteside)

diag_plots is an instance of an S4 class. Check this with sloop::otype(diag_plots).

Code
class(diag_plots)
[1] "ggmultiplot"
attr(,"package")
[1] "ggfortify"

!!! (bang-bang-bang) is provided by rlang package.

Code
diag_plots@plots[[1]] +
  labs(!!!vanilla)

This diagnostic plots would be much more helpful if we could map Insul on the shape aesthetics.

Code
# diag_plots@plots[[1]]  + aes(shape=Insul)
Code
lm0_augmented <- augment(lm0, data=whiteside) 

lm0_augmented |>
  ggplot() +
  aes(x=.fitted, y=.resid) +
  geom_point(aes(shape=Insul)) +
  geom_smooth(formula='y~x', 
              method="loess", 
              se=T, 
              color="black",
              linewidth=.5) +
  xlab("Fitted values") +
  ylab("Residuals") +
  labs(
    !!!vanilla,
    subtitle = "Residuals versus fitted values  (hand-made)"
  ) +
  theme_minimal()

Overplotting (mapping Insul over shape aesthetics) reveals that residuals do depend on Insul: residual signs are a function of Insul.

The smooth regression line (blue line) departs from the x axis, but not in a striking way.

Code
patchwork::wrap_plots(
  (diag_plots@plots[[2]] + coord_fixed() + labs()),  
  (
  augment(lm0, whiteside) |>
  ggplot() +
    aes(y=sort(.std.resid), 
        x=qnorm(seq_along(.std.resid)/length(.std.resid))) +
    geom_point(aes(shape=Insul)) + 
    geom_abline(intercept = 0, slope=1.0111398) +
    geom_smooth(method="lm", se=F, linetype="dotted") +
    geom_abline(intercept=0, slope=4/3) +
    xlab("Normal quantiles") +
    ylab("Standardized residuals") +
    coord_fixed() +
    labs()
  )
) +
patchwork::plot_annotation(
  title =vanilla["title"],
  subtitle="QQ plots for linear regression",
  caption = vanilla["caption"]
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1 row containing non-finite outside the scale range
(`stat_smooth()`).

Code
diag_plots@plots[[3]] +
  geom_smooth(method="lm", se=T, formula='y ~ x', linetype="dotted") +
  labs(
    !!!vanilla,
    subtitle = "Standardized residuals versus fitted values"
  )

The amplitude of standardized residuals seems to be an increasing function of fitted values. This suggests departure from homoschedasticity.

Two points are singled out (with row index)

Code
diag_plots@plots[[4]] +
  
  labs(
    !!!vanilla,
    subtitle = ""
  )

Double check points with high leverage.

Why points with high leverage and large standardized residuals matter.

The motivation and usage of diagnostic plots is explained in detail in the book by Fox and Weisberg: An R companion to applied regression.

In words, the diagnostic plots serve to assess the validity of the Gaussian linear modelling assumptions on the dataset. The assumptions can be challenged for a number of reasons:

  • The response variable may be a function of the covariates but not a linear one.
  • The noise may be heteroschedastic
  • The noise may be non-Gaussian
  • The noise may exhibit dependencies.

The diagnostic plots are built from the information gathered in the lm object returned by lm(...).

It is convenient to extract the required pieces of information using method augment.lm(). of generic function augment() from package broom.

TipSolution
Code
autoplot(lm0)

Code
whiteside_aug <- lm0 |>
  augment(whiteside)

lm0 %$% ( # exposing pipe !!! 
  augment(., data=whiteside) %>% 
  mutate(across(!where(is.factor), ~ signif(.x, 3))) %>% 
  group_by(Insul) %>% 
  sample_n(5) %>% 
  ungroup() %>% 
  gt::gt() %>% 
  gt::tab_caption(glue("Dataset {call$data},  {deparse(call$formula)}, sampled rows"))
)
Dataset whiteside, Gas ~ Temp, sampled rows
Insul Temp Gas .fitted .resid .hat .sigma .cooksd .std.resid
Before 7.5 4.0 3.31 0.690 0.0344 0.863 0.01190 0.816
Before 2.9 5.8 4.64 1.160 0.0272 0.854 0.02590 1.360
Before 8.5 3.6 3.02 0.581 0.0495 0.865 0.01250 0.692
Before 7.4 4.2 3.34 0.861 0.0332 0.860 0.01780 1.020
Before 7.5 3.9 3.31 0.590 0.0344 0.865 0.00869 0.698
After 0.8 4.6 5.25 -0.654 0.0578 0.864 0.01880 -0.783
After 7.2 2.8 3.40 -0.597 0.0309 0.865 0.00790 -0.704
After 5.0 3.6 4.04 -0.435 0.0179 0.867 0.00237 -0.510
After 4.7 3.5 4.12 -0.622 0.0179 0.864 0.00486 -0.730
After 1.6 4.2 5.02 -0.822 0.0437 0.861 0.02180 -0.977

Recall that in the output of augment()

  • .fitted: \(\widehat{Y} = H \times Y= X \times \widehat{\beta}\)
  • .resid: \(\widehat{\epsilon} = Y - \widehat{Y}\) residuals, \(\sim (\text{Id}_n - H) \times \epsilon\)
  • .hat: diagonal coefficients of Hat matrix \(H\)
  • .sigma: is meant to be the estimated standard deviation of components of \(\widehat{Y}\)

Compute the share of explained variance

TipSolution
Code
whiteside_aug %$% {
  1 - (var(.resid)/(var(Gas)))
}
[1] 0.4668366
Code
# with(whiteside_aug,
#   1 - (var(.resid)/(var(Gas)))

Plot residuals against fitted values

TipSolution
Code
diag_1 <- whiteside_aug |>
  ggplot() +
  aes(x=.fitted, y=.resid)+
  geom_point(aes(shape= Insul), size=1, color="black") +
  geom_smooth(formula = y ~ x,
              method="loess",
              se=F,
              linetype="dotted",
              linewidth=.5,
              color="black") +
  geom_hline(yintercept = 0, linetype="dashed") +
  xlab("Fitted values") +
  ylab("Residuals)") +
  labs(caption = "Residuals versus Fitted")
NoteQuestion

Fitted against square root of standardized residuals.

TipSolution
Code
diag_3 <- whiteside_aug %>%
  ggplot() +
  aes(x=.fitted, y=sqrt(abs(.std.resid))) +
  geom_smooth(formula = y ~ x,
              se=F,
              method="loess",
              linetype="dotted",
              linewidth=.5,
              color="black") +
  xlab("Fitted values") +
  ylab("sqrt(|standardized residuals|)") +
  geom_point(aes(shape=Insul), size=1, alpha=1) +
  labs(caption = "Scale location")
NoteQuestion

Hand-made normal qqplot for lm0

TipSolution
Code
diag_2 <- whiteside_aug %>% 
  ggplot() +
  aes(sample=.std.resid) +
  geom_qq(size=.5, alpha=.5) +
  stat_qq_line(linetype="dotted",
              linewidth=.5,
              color="black") +
#  coord_fixed() +
  labs(caption="Residuals qqplot") +
  xlab("Theoretical quantiles") +
  ylab("Empirical quantiles of standadrdized residuals")
NoteQuestion

Gather the three hand-made diagnostic plots together using patchwork

TipSolution
Code
(diag_1 + diag_2 + diag_3 + guide_area()) + 
  plot_layout(guides="collect") +
  plot_annotation(title=glue("{cur_dataset} dataset"),
                  subtitle = glue("Regression diagnostic  {deparse(lm0$call$formula)}"), caption = 'The fact that the sign of residuals depend on Insul shows that our modeling is too naive.\n The qqplot suggests that the residuals are not collected from Gaussian homoschedastic noise.'
                  )

Taking into account Insulation

NoteQuestion

Design a formula that allows us to take into account the possible impact of Insulation. Insulation may impact the relation between weekly Gas consumption and average external Temperature in two ways. Insulation may modify the Intercept, it may also modify the slope, that is the sensitivity of Gas consumption with respect to average external Temperature.

Have a look at formula documentation (?formula).

TipSolution
Code
lm1 <- lm(Gas ~ Temp * Insul, data = whiteside)

vanilla_lm1 <- vanilla
vanilla_lm1["title"] <- "Diagnostic plot for linear regression  Gas ~ Temp*Insul"
NoteQuestion

Check the design using function model.matrix(). How can you relate this augmented design and the one-hot encoding of variable Insul?

TipSolution
Code
model.matrix(lm1) |>
  as_tibble() |>
  slice_sample(n=5, by=InsulAfter)|>
  gt::gt() |>
  gt::tab_caption(glue("Excerpt of model matrix for linear regression  Gas ~ Temp*Insul"))
Excerpt of model matrix for linear regression Gas ~ Temp*Insul
(Intercept) Temp InsulAfter Temp:InsulAfter
1 8.5 0 0.0
1 7.4 0 0.0
1 7.0 0 0.0
1 3.2 0 0.0
1 3.9 0 0.0
1 3.1 1 3.1
1 3.9 1 3.9
1 2.3 1 2.3
1 4.2 1 4.2
1 4.7 1 4.7

InsulAfter is the result of one-hot encoding of boolean column Insul: Before is encoded by 0 and After is encoded by 1.

Temp:InsulAfter is the elementwise product of Temp and InsulAfter.

In one-hot-encoding, a categorical columns with k levels is mapped to k 0-1 valued columns. An occurrence of level j : 1 ≤ j ≤ k is mapped to a sequence of j-1 0‘s, a 1, followed by k-j 0’s.
The resulting matrix has column rank at most ’k-1’, since the rowsums are all equal to 1. In order to eliminate this redundancy, one of the k columns is dropped. Here the column with 1’s for Before has been dropped.

Note that other encodings (called contrasts in the statistics literature) are possible.

NoteQuestion

Display and comment the part of the summary of lm1 concerning estimated coefficients.

TipSolution
Code
lm1 %>% 
  tidy_lm()
Linear regrression. Dataset: whiteside, Formula: Gas ~ Temp * Insul
term estimate std.error statistic p.value
(Intercept) 6.85 0.14 50.41 0.00
Temp −0.39 0.02 −17.49 0.00
InsulAfter −2.13 0.18 −11.83 0.00
Temp:InsulAfter 0.12 0.03 3.59 0.00
TipSolution
Code
bind_rows(
  lm0 |>
    glance() |>
    mutate(model="Gas ~ Temp"),
  lm1 |>
    glance() |>
    mutate(model="Gas ~ Temp*Insul")
) |>
  relocate(model) |>
  gt::gt() |>
  gt::fmt_number(decimals=2) |>
  gt::tab_caption("Comparing two models for Whiteside dataset")
Comparing two models for Whiteside dataset
model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
Gas ~ Temp 0.47 0.46 0.86 47.28 0.00 1.00 −70.04 146.07 152.15 39.99 54.00 56.00
Gas ~ Temp*Insul 0.93 0.92 0.32 222.33 0.00 3.00 −14.10 38.20 48.33 5.43 52.00 56.00
NoteQuestion

Comment the diagnostic plots built from the extended model using autoplot(). If possible, generate alternative diagnostic plots pipelining broom and ggplot2.

TipSolution
Code
diag_plots_lm1 <- autoplot(lm1, data=whiteside)
Code
diag_plots_lm1@plots[[1]] +
  labs(
    !!!vanilla_lm1,
    subtitle="Residuals versus fitted values"
  )

Code
whiteside_aug1 <-  augment(lm1, whiteside)

whiteside_aug1 |>
  ggplot() + 
  aes(x=.fitted, y=.resid) +
  geom_point(aes(shape=Insul)) +
  geom_smooth(method="loess", formula='y ~ x') +
  labs(
    !!!vanilla_lm1,
    subtitle="Residuals versus fitted values"
  )

The sign of residuals does not look like a function of Insul anymore.

Code
p +
  geom_smooth(formula='y ~ poly(x, 2)',linewidth=.5, color="black",linetype="dashed",  method="lm", se=FALSE)+
  aes(color=Insul) +
  geom_smooth(aes(linetype=Insul), 
              formula='y ~ x',linewidth=.5, color="black", method="lm", se=FALSE) +
  scale_color_manual(values= c("Before"="red", "After"="blue")) +
  geom_abline(intercept = 6.8538, slope=-.3932, color="red") +
  geom_abline(intercept = 6.8538 - 2.13, slope=-.3932 +.1153, color="blue") + labs(
    title=glue("{cur_dataset} dataset"),
    subtitle = glue("Regression: {deparse(lm1$call$formula)}")
    )

TipSolution
Code
(diag_1 %+% whiteside_aug1) +
(diag_3 %+% whiteside_aug1) +  
(diag_2 %+% whiteside_aug1) +
  guide_area() +
  plot_layout(guides = "collect",) +
  plot_annotation(title=glue("{cur_dataset} dataset"),
                  subtitle = glue("Regression diagnostic  {deparse(lm1$call$formula)}"), caption = 'One possible outlier.\n Visible on all three plots.'
                  )

The formula argument defines the design matrix and the Least-Squares problem used to estimate the coefficients.

Code
cowplot::plot_grid(
  (diag_1 %+% whiteside_aug1),
  (diag_3 %+% whiteside_aug1),
  (diag_2 %+% whiteside_aug1),
  align="none",
  nrow=3, 
  labels=c('A', 'B', 'C')
)

Code
diag_plots_lm1@plots[[1]] + 
  diag_plots_lm1@plots[[3]] +
  patchwork::plot_annotation(
    title="Gas ~ Temp * Insul",
    # subtitle = "Residuals versus fitted",
    caption="Whiteside data"
  )

Code
diag_plots_lm1@plots[[2]] +
  labs(!!!vanilla_lm1)

Code
diag_plots@plots[[4]] +
  labs(
    !!! vanilla_lm1
  )

Function model.matrix() allows us again to inspect the design matrix.

TipSolution
Code
model.matrix(lm1) %>% 
  as_tibble() %>% 
  mutate(Insul=ifelse(InsulAfter,"After", "Before")) %>% 
  ungroup() %>% 
  slice_sample(n=5) |>
  gt::gt() |>
  gt::tab_caption(glue("Design matrix for {deparse(lm1$call$formula)}")) 
Design matrix for Gas ~ Temp * Insul
(Intercept) Temp InsulAfter Temp:InsulAfter Insul
1 8.8 1 8.8 After
1 -0.8 0 0.0 Before
1 7.5 1 7.5 After
1 6.3 0 0.0 Before
1 7.4 0 0.0 Before
Code
X <- model.matrix(lm1)

Solving OLS with QR factorization

In order to solve le Least-Square problems, we have to compute \[(X^T \times X)^{-1} \times X^T\] This can be done in several ways.

lm() uses QR factorization.

NoteQuestion

Extract the QR factorization of the design matrix from lm1

  • Check that the columns of Q form an orthorgonal family
  • Check that R is upper triangular
  • Compute the Hat matrix from the QR factorization
  • Check that the Hat matrix is an orthonormal projection matrix on the column space of the design.
TipSolution
Code
Q <- qr.Q(lm1$qr)
R <- qr.R(lm1$qr)  # R is upper triangular 

norm(X - Q %*% R, type="F") # QR Factorization
[1] 1.753321e-14
Code
signif(t(Q) %*% Q, 2)      # Q's columns form an orthonormal family
         [,1]     [,2]     [,3]    [,4]
[1,]  1.0e+00 -1.4e-17  3.1e-17 1.7e-16
[2,] -1.4e-17  1.0e+00 -3.5e-17 1.4e-17
[3,]  3.1e-17 -3.5e-17  1.0e+00 0.0e+00
[4,]  1.7e-16  1.4e-17  0.0e+00 1.0e+00
Code
H <- Q %*% t(Q)             # The Hat matrix 

norm(X - H %*% X, type="F") # H leaves X's columns invariant
[1] 1.758479e-14
Code
norm(H - H %*% H, type="F") # H is idempotent
[1] 7.993681e-16
Code
# eigen(H, symmetric = TRUE, only.values = TRUE)$values
Code
sum((solve(t(X) %*% X) %*% t(X) %*% whiteside$Gas - lm1$coefficients)^2)
[1] 3.075652e-29

Once we have the QR factorization of \(X\), solving the normal equations boils down to inverting a triangular matrix.

Code
sum((solve(R) %*% t(Q) %*% whiteside$Gas - lm1$coefficients)^2)
[1] 2.050287e-29
NoteQuestion

Compute the Gram matrix of the row vectors of the design matrix for lm1

TipSolution
Code
#matador::mat2latex(signif(solve(t(X) %*% X), 2))

\[ (X^T \times X)^{-1} = \begin{bmatrix} 0.18 & -0.026 & -0.18 & 0.026\\ -0.026 & 0.0048 & 0.026 & -0.0048\\ -0.18 & 0.026 & 0.31 & -0.048\\ 0.026 & -0.0048 & -0.048 & 0.0099 \end{bmatrix} \]

NoteQuestion

Understand .fitted column in augment(lm1, whiteside)

TipSolution
Code
sum((predict(lm1, newdata = whiteside) - whiteside_aug1$.fitted)^2)
[1] 0
Code
sum((H %*% whiteside_aug1$Gas - whiteside_aug1$.fitted)^2)
[1] 3.478877e-28
NoteQuestion

Try understanding the computation of .resid in an lm object. Compare .resid with the projection of Gas on the linear subspace orthogonal to the columns of the design matrix.

TipSolution
Code
sum((whiteside_aug1$.resid + H %*% whiteside_aug1$Gas - whiteside_aug1$Gas)^2)
[1] 3.461127e-28
NoteQuestion

Understand .hat

TipSolution
Code
sum((whiteside_aug1$.hat - diag(H))^2)
[1] 0
NoteQuestion

Understanding .std.resid

  • Estimate noise intensity from residuals
  • Compare with the output of glance()
TipSolution
Code
sigma_hat <- sqrt(sum(lm1$residuals^2)/lm1$df.residual)

lm1 |>
  glance() |>
  gt::gt() |>
  gt::fmt_number(decimals=2) |>
  gt::tab_caption(glue("The hand-made estimate of sigma is {round(sigma_hat,2)}"))
The hand-made estimate of sigma is 0.32
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.93 0.92 0.32 222.33 0.00 3.00 −14.10 38.20 48.33 5.43 52.00 56.00

\[ \widehat{r}_i = \frac{\widehat{\epsilon}_i}{\widehat{\sigma} \sqrt{1 - H_{i,i}}} \]

Code
sum((sigma_hat * sqrt(1 -whiteside_aug1$.hat) * whiteside_aug1$.std.resid - whiteside_aug1$.resid)^2)
[1] 4.471837e-28
NoteQuestion

Understanding column .sigma

TipSolution

Column .sigma contains the leave-one-out estimates of \(\sigma\), that is whiteside_aug1$.sigma[i] is the estimate of \(\sigma\) you obtain by leaving out the i row of the dataframe.

There is no need to recompute everything for each sample element.

\[ \widehat{\sigma}^2_{(i)} = \widehat{\sigma}^2 \frac{n-p-1- \frac{\widehat{\epsilon}_i^2}{\widehat{\sigma}^2 {(1 - H_{i,i})}}\frac{}{}}{n-p-2} \]

Appendix

This lab uses constructs of language R we have not mentioned so far.
Have a look at the documentation to get acquainted with this important and convenient notions.

S3 classes in R

Generic functions for S3 classes are used extensively in this lab. Functions from broom, ggplot2, and ggfortify are often generic functions.

methods(autoplot) lists the S3 classes for which an autoplot method is defined. Some methods are defined in ggplot2, others like autoplot.lm are defined in extension packages like ggfortify.

S4 classes in R

The output of autoplot.lm is an instance of S4 class.

See also New trends:

The S7 package is a new OOP system designed to be a successor to S3 and S4. It has been designed and implemented collaboratively by the R Consortium Object-Oriented Programming Working Group, which includes representatives from R-Core, Bioconductor, the tidyverse/Posit, and the wider R community.

tibbles with list columns

We also used tibbles with list columns, that is nested dataframes. Nested dataframes show up in every implementation of dataframes (from R to Spark), see: Tidyr: article on nesting